logit difference
- North America > United States (0.14)
- Asia > Middle East > Jordan (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (10 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- (6 more...)
- Workflow (0.94)
- Research Report (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
APP: Accelerated Path Patching with Task-Specific Pruning
Andersen, Frauke, Rudman, William, Zhang, Ruochen, Eickhoff, Carsten
Circuit discovery is a key step in many mechanistic interpretability pipelines. Current methods, such as Path Patching, are computationally expensive and have limited in-depth circuit analysis for smaller models. In this study, we propose Accelerated Path Patching (APP), a hybrid approach leveraging our novel contrastive attention head pruning method to drastically reduce the search space of circuit discovery methods. Our Contrastive-FLAP pruning algorithm uses techniques from causal mediation analysis to assign higher pruning scores to task-specific attention heads, leading to higher performing sparse models compared to traditional pruning techniques. Although Contrastive-FLAP is successful at preserving task-specific heads that existing pruning algorithms remove at low sparsity ratios, the circuits found by Contrastive-FLAP alone are too large to satisfy the minimality constraint required in circuit analysis. APP first applies Contrastive-FLAP to reduce the search space on required for circuit discovery algorithms by, on average, 56\%. Next, APP, applies traditional Path Patching on the remaining attention heads, leading to a speed up of 59.63\%-93.27\% compared to Path Patching applied to the dense model. Despite the substantial computational saving that APP provides, circuits obtained from APP exhibit substantial overlap and similar performance to previously established Path Patching circuits
- North America > United States > Texas > Travis County > Austin (0.04)
- Europe > Germany > Baden-Württemberg > Tübingen Region > Tübingen (0.04)
- North America > United States (0.14)
- Asia > Middle East > Jordan (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- (10 more...)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
- North America > United States > Washington > King County > Seattle (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- (6 more...)
- Workflow (0.94)
- Research Report (0.68)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
A Principled Loss Function for Direct Language Model Alignment
The alignment of large language models (LLMs) with human preferences is commonly achieved through Reinforcement Learning from Human Feedback (RLHF). Direct Preference Optimization (DPO) simplified this paradigm by establishing a direct mapping between the optimal policy and a reward function, eliminating the need for an explicit reward model. However, we argue that the DPO loss function is theoretically misaligned with its own derivation, as it promotes the indefinite maximization of a logits difference, which can lead to training instability and reward hacking. In this paper, we propose a novel loss function derived directly from the RLHF optimality condition. Our proposed loss targets a specific, finite value for the logits difference, which is dictated by the underlying reward, rather than its maximization. We provide a theoretical analysis, including a gradient-based comparison, to demonstrate that our method avoids the large gradients that plague DPO when the probability of dispreferred responses approaches zero. This inherent stability prevents reward hacking and leads to more effective alignment. We validate our approach by fine-tuning a Qwen2.5-7B model, showing significant win-rate improvements over a standard DPO baseline and achieving competitive performance against larger models like Llama-3.1-8B.
From Indirect Object Identification to Syllogisms: Exploring Binary Mechanisms in Transformer Circuits
Saraipour, Karim, Zhang, Shichang
Transformer-based language models (LMs) can perform a wide range of tasks, and mechanistic interpretability (MI) aims to reverse engineer the components responsible for task completion to understand their behavior. Previous MI research has focused on linguistic tasks such as Indirect Object Identification (IOI). In this paper, we investigate the ability of GPT-2 small to handle binary truth values by analyzing its behavior with syllogistic prompts, e.g., "Statement A is true. Statement B matches statement A. Statement B is", which requires more complex logical reasoning compared to IOI. Through our analysis of several syllogism tasks of varying difficulty, we identify multiple circuits that mechanistically explain GPT-2's logical-reasoning capabilities and uncover binary mechanisms that facilitate task completion, including the ability to produce a negated token not present in the input prompt through negative heads. Our evaluation using a faithfulness metric shows that a circuit comprising five attention heads achieves over 90% of the original model's performance. By relating our findings to IOI analysis, we provide new insights into the roles of specific attention heads and MLPs in LMs. These insights contribute to a broader understanding of model reasoning and support future research in mechanistic interpretability.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
Mechanistic Interpretability in the Presence of Architectural Obfuscation
Florencio, Marcos, Barton, Thomas
Architectural obfuscation - e.g., permuting hidden-state tensors, linearly transforming embedding tables, or remapping tokens - has recently gained traction as a lightweight substitute for heavyweight cryptography in privacy-preserving large-language-model (LLM) inference. While recent work has shown that these techniques can be broken under dedicated reconstruction attacks, their impact on mechanistic interpretability has not been systematically studied. In particular, it remains unclear whether scrambling a network's internal representations truly thwarts efforts to understand how the model works, or simply relocates the same circuits to an unfamiliar coordinate system. We address this gap by analyzing a GPT-2-small model trained from scratch with a representative obfuscation map. Assuming the obfuscation map is private and the original basis is hidden (mirroring an honest-but-curious server), we apply logit-lens attribution, causal path-patching, and attention-head ablation to locate and manipulate known circuits. Our findings reveal that obfuscation dramatically alters activation patterns within attention heads yet preserves the layer-wise computational graph. This disconnect hampers reverse-engineering of user prompts: causal traces lose their alignment with baseline semantics, and token-level logit attributions become too noisy to reconstruct. At the same time, feed-forward and residual pathways remain functionally intact, suggesting that obfuscation degrades fine-grained interpretability without compromising top-level task performance. These results establish quantitative evidence that architectural obfuscation can simultaneously (i) retain global model behaviour and (ii) impede mechanistic analyses of user-specific content. By mapping where interpretability breaks down, our study provides guidance for future privacy defences and for robustness-aware interpretability tooling.
- Pacific Ocean > North Pacific Ocean > San Francisco Bay > Golden Gate (0.04)
- North America > United States > Texas (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
ContextBench: Modifying Contexts for Targeted Latent Activation
Graham, Robert, Stevinson, Edward, Richter, Leo, Chia, Alexander, Miller, Joseph, Bloom, Joseph Isaac
Identifying inputs that trigger specific behaviours or latent features in language models could have a wide range of safety use cases. We investigate a class of methods capable of generating targeted, linguistically fluent inputs that activate specific latent features or elicit model behaviours. We formalise this approach as context modification and present ContextBench -- a benchmark with tasks assessing core method capabilities and potential safety applications. Our evaluation framework measures both elicitation strength (activation of latent features or behaviours) and linguistic fluency, highlighting how current state-of-the-art methods struggle to balance these objectives. We enhance Evolutionary Prompt Optimisation (EPO) with LLM-assistance and diffusion model inpainting, and demonstrate that these variants achieve state-of-the-art performance in balancing elicitation effectiveness and fluency.
- North America > United States (0.14)
- Europe > Ukraine (0.04)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (3 more...)
- Information Technology (0.46)
- Health & Medicine > Therapeutic Area (0.46)
Reversing the Forget-Retain Objectives: An Efficient LLM Unlearning Framework from Logit Difference
A conventional LLM unlearning task typically involves two goals: (1) The target LLM should forget the knowledge in the specified forget documents; and (2) it should retain the other knowledge that the LLM possesses, for which we assume access to a small number of retain documents. To achieve both goals, a mainstream class of LLM unlearning methods introduces an optimization framework with a combination of two objectives – maximizing the prediction loss on the forget documents while minimizing that on the retain documents, which suffers from two challenges, degenerated output and catastrophic forgetting. In this paper, we propose a novel unlearning framework called Unlearning from Logit Difference (ULD), which introduces an assistant LLM that aims to achieve the opposite of the unlearning goals: remembering the forget documents and forgetting the retain knowledge. ULD then derives the unlearned LLM by computing the logit difference between the target and the assistant LLMs. We show that such reversed objectives would naturally resolve both aforementioned challenges while significantly improving the training efficiency.